13 research outputs found

    Adaptive load balancing for HPC applications

    Get PDF
    One of the critical factors that affect the performance of many applications is load imbalance. Applications are increasingly becoming sophisticated and are using irregular structures and adaptive refinement techniques, resulting in load imbalance. Moreover, systems are becoming more complex. The number of cores per node is increasing substantially and nodes are becoming heterogeneous. High variability in the performance of the hardware components introduces further imbalance. Load imbalance leads to drop in system utilization and degrades the performance. To address the load imbalance problem, many HPC applications employ dynamic load balancing algorithms to redistribute the work and balance the load. Therefore, performing load balancing is necessary to achieve high performance. Different application characteristics warrant different load balancing strategies. We need a variety of high-quality, scalable load balancing algorithms to cater to different applications. However, using an appropriate load balancer is insufficient to achieve good performance because performing load balancing incurs a cost. Moreover, due to the dynamic nature of the application, it is hard to decide when to perform load balancing. Therefore, deciding when to load balance and which strategy to use for load balancing may not be possible a priori. With the ever increasing core counts on a node, there will be a vast amount of on-node parallelism. Due to the massive on-node parallelism, load imbalance occurring at the node level can be mitigated within the node instead of performing a global load balancing. However, having the application developer manage resources and handle dynamic imbalances is inefficient as well as is a burden on the programmer. The focus of this dissertation is on developing scalable and adaptive techniques for handling load imbalance. The dissertation presents different load balancing algorithms for handling inter and intra-node load imbalance. It also presents an introspective run-time system, which will monitor the application and system characteristics and make load balancing decisions automatically

    HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU

    Full text link
    The end of Dennard scaling and the slowdown of Moore's law led to a shift in technology trends toward parallel architectures, particularly in HPC systems. To continue providing performance benefits, HPC should embrace Approximate Computing (AC), which trades application quality loss for improved performance. However, existing AC techniques have not been extensively applied and evaluated in state-of-the-art hardware architectures such as GPUs, the primary execution vehicle for HPC applications today. This paper presents HPAC-Offload, a pragma-based programming model that extends OpenMP offload applications to support AC techniques, allowing portable approximations across different GPU architectures. We conduct a comprehensive performance analysis of HPAC-Offload across GPU-accelerated HPC applications, revealing that AC techniques can significantly accelerate HPC applications (1.64x LULESH on AMD, 1.57x NVIDIA) with minimal quality loss (0.1%). Our analysis offers deep insights into the performance of GPU-based AC that guide the future development of AC algorithms and systems for these architectures.Comment: 12 pages, 12 pages. Accepted at SC2

    Fast And Automatic Floating Point Error Analysis With CHEF-FP

    Full text link
    As we reach the limit of Moore's Law, researchers are exploring different paradigms to achieve unprecedented performance. Approximate Computing (AC), which relies on the ability of applications to tolerate some error in the results to trade-off accuracy for performance, has shown significant promise. Despite the success of AC in domains such as Machine Learning, its acceptance in High-Performance Computing (HPC) is limited due to stringent requirements for accuracy. We need tools and techniques to identify regions of code that are amenable to approximations and their impact on the application output quality to guide developers to employ selective approximation. To this end, we propose CHEF-FP, a flexible, scalable, and easy-to-use source-code transformation tool based on Automatic Differentiation (AD) for analyzing approximation errors in HPC applications. CHEF-FP uses Clad, an efficient AD tool built as a plugin to the Clang compiler and based on the LLVM compiler infrastructure, as a backend and utilizes its AD abilities to evaluate approximation errors in C++ code. CHEF-FP works at the source by injecting error estimation code into the generated adjoints. This enables the error-estimation code to undergo compiler optimizations resulting in improved analysis time and reduced memory usage. We also provide theoretical and architectural augmentations to source code transformation-based AD tools to perform FP error analysis. This paper primarily focuses on analyzing errors introduced by mixed-precision AC techniques. We also show the applicability of our tool in estimating other kinds of errors by evaluating our tool on codes that use approximate functions. Moreover, we demonstrate the speedups CHEF-FP achieved during analysis time compared to the existing state-of-the-art tool due to its ability to generate and insert approximation error estimate code directly into the derivative source.Comment: 11 pages, to appear in the 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS'23

    Power, Reliability, Performance: One System to Rule Them All

    Get PDF
    En un dise帽o basado en el marco de programaci贸n paralelo Charm ++, un sistema de tiempo de ejecuci贸n adaptativo interact煤a din谩micamente con el administrador de recursos de un centro de datos para controlar la energ铆a mediante la programaci贸n inteligente de trabajos, la reasignaci贸n de recursos y la reconfiguraci贸n de hardware. Gestiona simult谩neamente la fiabilidad al enfriar el sistema al nivel 贸ptimo de la aplicaci贸n en ejecuci贸n y mantiene el rendimiento a trav茅s del equilibrio de carg

    Applying Graph Partitioning Methods in Measurement-based Dynamic Load Balancing

    Get PDF
    Load imbalance in an application can lead to degradation of performance and a significant drop in system utilization. Achieving the best parallel efficiency for a program requires optimal load balancing which is an NP-hard problem. This paper explores the use of graph partitioning algorithms, traditionally used for partitioning physical domains/meshes, for measurement-based dynamic load balancing of parallel applica- tions. In particular, we present repartitioning methods that consider the previous mapping to minimize dynamic migration costs. We also discuss the use of a greedy algorithm in conjunction with iterative graph partitioning algorithms to reduce the load imbalance for graphs with heavily skewed load distributions. These algorithms are implemented in a graph partitioning toolbox called SCOTCH and we use CHARM++, a migratable objects based programming model, to experiment with various load balancing scenarios. To compare with different load balancing strategies based on graph partitioners, we have implemented METIS and ZOLTAN-based load balancers in CHARM++. We demonstrate the effectiveness of the new algorithms de- veloped in SCOTCH in the context of the NAS BT solver and two micro-benchmarks. We show that SCOTCH based strategies lead to better performance compared to other existing partitioners, both in terms of the application execution time and fewer number of objects migrated.Ope

    Parallel Programming with Migratable Objects: Charm++ in Practice

    Get PDF
    The advent of petascale computing has introduced new challenges (e.g. Heterogeneity, system failure) for programming scalable parallel applications. Increased complexity and dynamism in science and engineering applications of today have further exacerbated the situation. Addressing these challenges requires more emphasis on concepts that were previously of secondary importance, including migratability, adaptivity, and runtime system introspection. In this paper, we leverage our experience with these concepts to demonstrate their applicability and efficacy for real world applications. Using the CHARM++ parallel programming framework, we present details on how these concepts can lead to development of applications that scale irrespective of the rough landscape of supercomputing technology. Empirical evaluation presented in this paper spans many miniapplications and real applications executed on modern supercomputers including Blue Gene/Q, Cray XE6, and Stampede

    Meta-Balancer: automated load balancing based on application behavior

    Get PDF
    With the dawn of petascale, and with exascale in the near future, it has become significantly difficult to write parallel applications that fully exploit the processing power, and scale to large systems. Load imbalance, both computationally and communication induced, presents itself as one of the important challenges in achieving scalability and high performance. Problem sizes and system sizes have become so large that manually handling the imbalance in dynamic applications, and finding an optimum distribution of load has become a herculean task. Charm++~\cite provides the user with a run time system that performs dynamic load balancing. To enable Charm++ to perform load balancing in an efficient manner, the user takes certain decisions such as when to load balance and which strategy to use, and informs the Charm++ run-time system of these decisions. Many a times, taking these important decisions involve hand tuning each application by observing various runs of the application. In this thesis, a Meta-Balancer which relieves the user from the effort of making the load balancing related decisions, is presented. The Meta-Balancer is a part of the Charm++ load balancing framework. It identifies the characteristics of the application, and based on the principle of persistence and the accrued information, makes load balancing related decisions. We study the performance of the Meta-Balancer in the context of leanmd mini application. We also evaluate the Meta-Balancer in the context of micro benchmarks such as kNeighbor and jacobi2D. We also present several new load balancing strategies, that have been incorporated into Charm++, and study their impact on the performance of applications. These new strategies are: 1)RefineSwapLB, which is a refinement based load balancing strategy, 2)CommAwareRefineLB, which is a communication aware refinement strategy, 3)ScotchRefineLB, which is a refinement based graph partitioning strategy using Scotch, a graph partitioner, and 4) ZoltanLB, which is a multicast aware load balancing strategy using Zoltan, a hypergraph partitioner

    POSTER

    No full text

    Automated Load Balancing Invocation based on Application Characteristics

    No full text
    Abstract鈥擯erformance of applications executed on large parallel systems suffer due to load imbalance. Load balancing is required to scale such applications to large systems. However, performing load balancing incurs a cost which may not be known a priori. In addition, application characteristics may change due to its dynamic nature and the parallel system used for execution. As a result, deciding when to balance the load to obtain the best performance is challenging. Existing approaches put this burden on the users, who rely on educated guess and extrapolation techniques to decide on a reasonable load balancing period, which may not be feasible and efficient. In this paper, we propose the Meta-Balancer framework which relieves the application programmers of deciding when to balance load. By continuously monitoring the application characteristics and using a set of guiding principles, Meta-Balancer invokes load balancing on its own without any prior application knowledge. We demonstrate that Meta-Balancer improves or matches the best performance that can be obtained by fine tuning periodic load balancing. We also show that in some cases Meta-Balancer improves performance by 18% whereas periodic load balancing gives only a 1.5 % benefit. Keywords-load balancing, automated, parallel, simulation I
    corecore